Dissertation Summary Recognizing Non-native Speech: Characterizing and Adapting to Non-native Usage in Lvcsr

نویسنده

  • Laura May
چکیده

Low-pro ien y non-native speakers represent a signi ant hallenge for large-vo abulary ontinuous spee h re ognition (LVCSR). A ousti models are onfused by a heavy a ent; language models are onfused by poor grammar and un onventional word hoi e. La k of omfort with the spoken language a e ts the fundamental properties of onne ted spee h that have been a fo us of LVCSR resear h; ross-word and interword oarti ulation, dis uen y, and prosody are among the features that di er in native and non-native spee h. In this dissertation, I rst address the problem of hara terizing low-pro ien y non-native spee h. One population is examined in great detail: learners of English whose native language is Japanese. Properties su h as uen y, vo abulary, and pa e in read and spontaneous spee h are measured for both general and pro ien yontrolled data sets. I further show that native and non-native spee h an be distinguished using a variety of statisti al metri s, in luding perplexity and Kullba k-Leibler divergen e. Patterns in reading errors and grammati ality of spontaneous spee h are quantitatively des ribed. This analysis, while fo using on one speaker population, provides a model for hara terizing non-native spee h that the broader LVCSR ommunity may nd useful. The generalizability of this model is demonstrated by ontrasting the spee h of native speakers of Mandarin with that of our primary speaker set. Se ond, I explore methods of adapting to non-native spee h. The test set is ontrolled for language exposure and pro ien y, and the task is a simpli ed read news task tailored toward the lower-pro ien y speakers, who experien ed limited su ess in more diÆ ult reading tasks like the widely-used Wall Street Journal readings. I nd that the largest gains in re ognition performan e ome through a ousti adaptation, and present evaluations of adaptation and training te hniques in orporating native-language and a ented data. From a speaker-adapted baseline of 63.1% (the same models perform at 8% for Broad ast News F0 spee h), a 29% relative improvement is a hieved through a ombination of adaptation and training. In ontrast, gains from lexi al modeling were found to be extremely small, even when investigated in onjun tion with retraining. I des ribe data-driven and linguisti ally-motivated algorithms for lexi al modeling, presenting experimental results and dis ussing possible reasons why the improvement was not larger. Finally, I present a novel method for dete ting non-native spee h. Without using any a ousti features, I show how bilateral and multilateral dis rimination an be a omplished on the basis of features present in text. Both re ognizer output and trans ripts of non-native spee h are identi ed with high a ura y through naive Bayes lassi ation. The word and part-of-spee h sequen es that are found to be indi ative of nonnative spee h provide an additional resour e for hara terizing non-native spee h, whi h leads to further insights about the properties of non-native spoken language. 1 Introdu tion Not so many years ago, dialogues like the following were the target of intensive basi resear h in spee h re ognition and natural language understanding. System: Wel ome to Pittsburgh Airlines' automated ight information servi e. Do you know the ight number? User: Five-twelve. System: Would you like arrival or departure information? User: Arrival. System: Flight 524 arriving on Tuesday, Feb 27th. Is that orre t? User: No. System: What would you like to hange, the ight number or the date? User: The ight number. System: Please say the ight number. User: Five one two. System: Flight 512 arriving on Wednesday, Feb 28th. Is that orre t? User: Yes. System: Flight 512 arrives in more than one ity. Would you like Boston Logan or Charleston, SC? ... Today, at least one major airline provides an automated intera tion exa tly like this as the front-line response to aller ight information queries. Clearly, spee h re ognition has begun to be embra ed by industry, whi h sees it as an eÆ ient repla ement for human operators, and tolerated, however grudgingly, by users. We an expe t to see more and more appli ations of spee h re ognition in the real world as the te hnology improves and publi a eptan e spreads. With real-world deployment, however, user diversity demands serious onsideration. Although the laim that all users should be o ered equal a ess to spee h re ognition is not strong (people with poor reading skills do not have the same a ess to newspapers as the highly literate, for example), it does appear that re ognition appli ations are on a traje tory to be ome a default interfa e for information dissemination systems. A ommodation of users whose language skills are somehow impaired is both an interesting resear h problem and a signi ant fun tional on ern. In the dialogue trans ribed above, I, a native speaker, alled up to make the query. Although the ight number was misre ognized at rst, overall, the re ognition quality was good, and I was able to get the information I required. I also asked a friend, a native speaker of Japanese who had lived in the United States for two years, to all for the same information. System: Wel ome to Pittsburgh Airlines' automated ight information servi e. Do you know the ight number? User: Yes. System: Please say the ight number. User: H ve hundred and twentve hundred and twelve. System: Would you like arrival or departure information? User: Arrival. System: Flight 283 arriving on Tuesday, Feb 27th. Is that orre t? User: ? No ight ve hundred and twelve please System: What would you like to hange, the ight number or the date? User:O? So... ight number is ve hundred and twenty no ve hundred and twelve System: Please say the ight number. User: Flight number is ve hundred and twelve. System: Flight 94 arriving on Wednesday, Feb 28th. Is that orre t? User: No. 1 System: What would you like to hange, the ight number or the date? User: Flight number, please. System: Please say the ight number. User: Five hundred and twenty. System: Flight 520 arriving on Wednesday, Feb 28th. Is that orre t? User: Yes. System: Flight 520 arrives in more than one ity. Would you like Chi ago O'Hare or Kansas City, MO? User: , no, no. Not twenty, twelve. System: Please repeat or rephrase that. User: ... ? System: Please repeat or rephrase that. User: < li k> In this ase, the speaker is prevented from nding out what she wants to know by a ombination of re ognition errors and her diÆ ulty responding appropriately to the system's prompts. One an envision other situations in whi h redu ed English pro ien y would diminish the e e tiveness of spee h-driven appli ations. Di tation systems, for example, are used by people ranging from physi ians re ording patient information to graduate students with typing injuries. Both of these groups have signi ant non-native populations. Conversational trans ription systems su h as meeting re ord and surveillan e systems annot assume that all subje ts will be uent speakers of the language. If a spee h translation system is available to fa ilitate English-Japanese ommuni ation, it may be used not only by native English speakers but also by the many others who speak English better than they speak Japanese. Language learning systems are limited in their ability to o er re ognition-based lessons be ause spee h re ognition of new learners is not reliable. It seems lear that native speakers are able to identify non-native speakers based on features like a ent, syntax, and uen y. Children an pinpoint and imitate spe i hara teristi s of spee h that mark it as typi al of a non-native group. When a listener is rst exposed to a variety of non-native spee h, he may initially struggle to understand it, but if he is a ooperative listener, he an often adapt very qui kly. Humans are in redibly well equipped to understand spee h, and tolerate deviation relatively well. Unfortunately, neither of these skills have ome as naturally to the ma hine. Computer understanding of spee h is based on statisti al models of patterns found in training orpora. When the a ent, syntax, and lexi al hoi e of the speaker are not well represented in the orpus, the models must somehow be adapted if good re ognition is to be a hieved. We might imagine several angles for atta king su h adaptation. The a ousti model spe i es the expe ted mapping of a ousti events to phoneti units. In a fullyontinuous ontext-dependent system su h as the one used in this dissertation, this is an extremely negrained representation. A ousti events are modeled on a sub-phoneti level, and tens of times as many variations are re ognized as would be in a traditional phoneti analysis. The a ousti model would be the natural pla e to represent phoneti di eren es in realization for a given speaker's a ent. The lexi on, whi h des ribes the phonemi makeup of words, would lend itself to modeling of phonemi di eren es and phonologi al simpli ation in produ tion. By altering the lexeme spe i ations, phonemi substitutions, epenthesis, elision, and in some ases phoneti realizational di eren es an be easily represented. The problem that arises is that the altered lexi on may not intera t with the a ousti model as expe ted. However, lexi al modeling is a straightforward approa h that has been used with su ess for native spee h (Humphries and Woodland, 1997; Huang et al., 2000) and non-native spee h for non-LVCSR tasks (Fung and Liu, 1999). The re ognizer's understanding of how words t together is en oded in the language model. Absent a natural language understanding omponent, the re ognizer has no understanding of the meaningfulness of a hypothesized utteran e, and must rely on a statisti al model to determine the likelihood of a sequen e of words having been uttered. By adapting the language model, the restri tions on probable word sequen es ould be relaxed for in reased toleran e of deviation from native patterns of spee h. Alternatively, one ould envision training a statisti al model of non-native spee h, representing expli itly patterns that are ommon in the spee h of non-natives. Finally, the system itself ould be adapted for greater exibility in pro essing non-native spee h. Just as human listeners are able to ask the speaker to repeat himself, delay pro essing while building ontext, and silently indu e lexi al, synta ti , and phoneti mappings from both positive and negative examples, a 2 system that endeavors to understand non-native spee h ould in orporate learning strategies with the aid of dialogue and natural language understanding omponents. This investigation will be restri ted to the re ognizer omponents that model pronun iation, namely the a ousti model and the lexi al model. In this dissertation, I on entrate prin ipally on native speakers of Japanese. This speaker population o ers great potential for experimental ontrol; English edu ation is standardized in Japan, and the Japanese population in Pittsburgh is large enough that nding speakers with similar edu ational ba kgrounds and exposure to English was not diÆ ult. The nature of Japanese-in uen ed English is also well studied, from both lexi al and phonota ti points of view. The many English words that have worked their way into everyday Japanese spee h have undergone semanti and phonologi al transformations that an help us to predi t how Japanese natives will approa h produ tion of English. Be ause nativized foreign words are represented in the Japanese s ript, an array of orthographi mappings are a essible that may provide further aid in developing a model of Japanese-in uen ed English. Appli ations of this work are also likely to be of interest in Japan. Language tutoring systems that model a parti ular native language (L1) well an present feedba k in the ontext of linguisti elements that are known to be problemati for speakers that share the user's L1. The Japanese government is urrently so on erned about the English ability of its itizens that it is onsidering the dramati step of making English an oÆ ial language (Kawai, 2000). Su h a requirement would in rease the demand for English training, and possibly for English versions of natural language systems urrently available in Japanese. In su h an eventuality, toleran e of non-native English would be riti al. 2 Non-native Spee h Database: Composition and Chara terization The di eren es between native and non-native spee h an be quanti ed in a variety of ways, all relevant to the problem of improving re ognition for non-native speakers. Di eren es in arti ulation, speaking rate, and pause distribution an a e t a ousti modeling. Di eren es in dis uen y distribution, word hoi e, syntax, and dis ourse style an a e t language modeling. And, of ourse, as these omponents are not independent of one another, all a e t overall re ognizer performan e. Although understanding how native and non-native spee h di er at all levels is learly an important rst step in atta king the problem of non-native re ognition, we have seen few detailed studies ontrasting native and non-native spee h patterns with respe t to features that are important to LVCSR. In this hapter, I provide su h an analysis, des ribing the di eren es between the native and non-native spee h samples I have olle ted and the methods used to quantify them. This analysis is important for spee h re ognition, but has impli ations for other areas of natural language pro essing su h as parsing and dis ourse pro essing. 2.1 Database omposition Spoken data was olle ted primarily from native speakers of Japanese, with a few native speakers of Mandarin in luded for omparison. English pro ien y of all speakers was evaluated using SPEAK, a standardized evaluation pro edure developed by the Edu ational Testing Servi e as part of the Test of English as a Foreign Language (TOEFL) program (SPE, 1987; Clark and Swinton, 1979). Using re ruitment and eli itation strategies that were re ned during pilot data olle tion experiments, a total of 58 native Japanese, 8 native Mandarin, and 10 native English speakers were re orded ompleting read and spontaneous tasks in English. The read task involved reading aloud from the Children's News Database (CND), a olle tion of news arti les written for hildren overing urrent events in the years 1998-2000. This database was sele ted over more ommon databases su h as Wall Street Journal be ause of the extreme diÆ ulty speakers had in reading adult-oriented texts. In the spontaneous task, speakers were prompted for queries in the tourist domain. In addition, speakers read from the story of Snow White. A breakdown of this database is given in Table 1. This dissertation fo uses on read spee h and lower-pro ien y speakers. For the re ognition experiments that will be des ribed in later se tions, a test set of 10 native Japanese speakers was de ned meeting these riteria. The analysis presented in this se tion, however, overs all speakers, of both higher and lower pro ien y, and in read and spontaneous tasks. 3 Prompted Story News (CND) Native language # speakers # utteran es # speakers # utteran es # speakers # utteran es Japanese 33 2257 13 795 31 3802 English 6 436 6 548 1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating Non-Native English Speaking Graduate Students’ Pragmatic Development in Requestive Emails

The present study investigated learners’ interlanguage pragmatic development through analysis of 99 requestive emails addressed to a faculty member over a period of up to two years. Most previous studies mainly investigated how non-native English speaking students’ (NNESs) pragmalinguistic and sociopragmatic competence differed from native English speaking students (NESs) and compared learners ...

متن کامل

Eliciting Natural Speech From Non-Native Users: Collecting Speech Data For LVCSR

In this paper, we discuss the design of a database of recorded and transcribed read and spontaneous speech of semiuent, strongly-accented non-native speakers of English. While many speech applications work best with a recognizer that expects native-like usage, others could bene t from a speech recognition component that is forgiving of the sorts of errors that are not a barrier to communication...

متن کامل

Native and Non-native English Teachers’ Rating Criteria and Variation in the Assessment of L2 Pragmatic Production: The Speech Act of Compliment

Pragmatic assessment and consistency in rating are among the subject matters which are still in need of more profound investigations. The importance of the issue is highlighted when remembering that inconsistency in ratings would surely damage the test fairness issue in assessment and lead to much diversity in ratings. Our principal concern in this study was observing the criteria that American...

متن کامل

Handling Non-native Speech in LVCSR: A Preliminary Study

In moving towards full incorporation of CSR in applications whose users include non-native speakers, an understanding of how the system can be modified to increase its tolerance to non-native idiosyncrasies such as accented pronunciation and disfluent form is essential. While experiments geared towards restricteduse systems have suggested that extremely simple techniques are effective, prelimin...

متن کامل

Non-native English Speaking Teachers’ Pragmatic Criteria in the Holistic and Analytic Rating of the Agreement Speech Act Productions of Iranian EFL Learners

Pragmatic rating is considered as one of the novel and crucial aspects of second language education which has not been maneuvered upon in the literature. To address this gap, the current study aimed to inspect the matches and mismatches, to explore rating variations, and to assess the rater consistency between the holistic and analytic rating methods of the speech act of agreement in L2 by non-...

متن کامل

Pragmatic Criteria in the Holistic and Analytic Rating of the Disagreement Speech Act of Iranian EFL Learners by Non-native English Speaking Teachers

onveying a strong message within a language stems from not only a linguistically appropriate utterance but also a pragmatically appropriate discourse. Broadly considering various facets of pragmatics, pragmatic assessment has not been potentially brought into perspective. To address this discourse gap, this study, guided by the principles of mixed-method design, pursued three purposes: ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001